Hello and welcome to this documentation where I use embeddings and its features to guide our decision-making for our risk polarization study. In this project, I want to get a feel for which risks to take for our study. This small project is done within the context of a large language seminar, and therefore will be rough around the edges.

Which Risks Are Needed?

Introduction

Our study wants to ask participants in how they rate certain risks in their perceived riskiness. The “Risk Group” has decided to use a more systematic approach in filtering out risks which are worth investigating. They used several sources, such as referring to the Basel Risk Norms and 2 separate scoping reviews (one from our Risk Polarization group, the other from Amanda), and have listed around 100 risks, and labeled them according to domains (such as health, finances, political, crime and nature). Though to be as efficient as possible, we have to choose risks which are worth to be asked in the first place, as some risks are more similar than others.

One way to do it is to ask humans to rate/ sort them into clusters/ domains themselves. But as this also takes time and money, this project tries to leverage embeddings to do the clustering and mapping. Huge shout out to the rpackage embedR and its author Dirk Wulff. Working with embeddings in R is made very easy with this package, in addition to the generous pipeline(s) provided in his github webpage.

This small project aims to solve the problem of choosing the right risks in each domain without redundancies, that is not taking too similar risk items in each domain. Another problem is identifying whether the different framings how one can ask about the dimensions of risks would change how the risk is perceived.

We can solve the first problem with the embeddings, as we would get a similarity matrix, and can therefore map the risks on a two dimensional space. The second problem can be solved when we compare said similarity matrix from the different dimensions with each other after the risks are already seggregated into clusters with unsupervised machine learning method.

THIS IS WHERE I INSERT MY PERSONAL HUGGINGFACE TOKEN! DUE TO SECURITY REASONS, THIS IS NOT PROVIDED HERE

er_set_tokens(huggingface = "YOUR PERSONAL TOKEN FROM HUGGINGFACE")

Method & Results

Risks

risk_dat <- read_xlsx("Risks.xlsx", range = "B1:F21")

#pivot longer (so each risk and their domain are in a column)
##remove the numbers on top of each risk word
risk_dat <- risk_dat %>% 
  pivot_longer(everything(), names_to = "Domain/Label", values_to = "Risk/Items") %>% 
  mutate(`Risk/Items` = str_replace_all(`Risk/Items`, "¹|²|³|⁴|⁵|⁶|⁷|⁸|⁹", ""))

#it seems like we only have 99 unique risks, instead of 100. Also, some have two risks in it.
#I'll just use the first one, as the risks are more similar to others to begin with.
unique(risk_dat$`Risk/Items`)

#motor vehicles appears two times.
risk_dat[duplicated(risk_dat$`Risk/Items`),2]

dupl <- risk_dat[duplicated(risk_dat$`Risk/Items`),]$`Risk/Items`

risk_dat[risk_dat$`Risk/Items` == dupl,]

#remove duplicated
risk_dat <- risk_dat[!duplicated(risk_dat$`Risk/Items`),]

#remove second risks in an item
risk_dat <- risk_dat %>% 
  mutate(`Risk/Items` = sub("/.*", "", `Risk/Items`),
         `Risk/Items` = sub("\\.*", "", `Risk/Items`),
         `Risk/Items`= gsub("[\r\n]", "", `Risk/Items`)) %>% 
   arrange(`Domain/Label`, `Risk/Items`)

Here is a list of risks we are working with:

risk_dat %>% 
   arrange(`Domain/Label`) %>% 
   paged_table(options = list(rownames.print = F,
                           rows.print = 20))

Though we also want to contextualize the risks. We’ll use the paper from Wilson et al. (2018) as reference for this.

In their article, one could take 4 different dimensions: there is a generic question asking about how risky one think X is. And then there are 3 other ways (or dimensions) of how can ask about ones risk perception: affect, probability, and consequences. So let’s do it for all contextualized methods. For the 3 dimensions, we will take the one question with the highest factor loading, so each dimension has one question.

risk_dat <- risk_dat %>% 
   mutate(Risk_ID = 1:nrow(risk_dat))

dimension = c("generic", "affect", "probability", "consequences")
risk_grid <- expand_grid(Risks = unique(risk_dat$`Risk/Items`), dimension = dimension)

risk_dat_contextualized <- risk_dat %>% 
   full_join(risk_grid, by = c("Risk/Items" = "Risks")) %>% 
   mutate(`Risk/Items` = case_when(dimension == "generic"  ~ paste0("How risky do you think ", `Risk/Items`, " is?"),
                                   dimension == "affect" ~ paste0("When you think about ", `Risk/Items`, " for a moment, to what extent do you feel worried?"),
                                   dimension == "probability" ~ paste0("How likely is it that ", `Risk/Items`, " will occur this year where you live?"),
                                   dimension == "consequences" ~ paste0("If I did experience ", `Risk/Items`, " it would have a severe effect on me personally")
                                   )) %>% 
   arrange(dimension, `Domain/Label`)

risk_dat_contextualized %>% 
   select(`Risk/Items`, everything()) %>% 
   paged_table(options = list(rownames.print = F,
                              rows.print = 20))

Though in this case, not every word is perfectly fitting to the question itself, but it should suffice for this small project. The Risk ID is here so we can match the risks again so we can compare all dimensions of risks (either none, or the generic, affect, probability, consequences).

For our LLM presentation, we want to include less risks so it is easier to understand the plots etc., so we will handpick some of the risks among the 99.

# selected_risks <- c("assassin", "bomb", "gun", "drugs", "global warming", "suicide", "vaccine", "war", "climate change", "alcohol", "fraud", "social risk", "ageing of population", "corruption", "propaganda", "environmental threats")
# 
# risk_dat <- risk_dat %>% 
#    filter(`Risk/Items` %in% selected_risks)
# 
# risk_dat_contextualized <- risk_dat_contextualized %>% 
#    filter(Risk_ID %in% risk_dat$Risk_ID)

Using Embeddings to Compare Different Dimensionality Settings

The data has two columns, the name of the domains ("Domain/Label"), and the risks (Risk/Items), which are the targets of the embedding analysis. Using the er_embed() function, we can embed the items. We will be using the default all-mpnet-base-v2 model from hugging face. This model is a good lightweight model for embedding analyses.

risk_embed <-  er_embed(text = risk_dat$`Risk/Items`, 
                     api = "huggingface", 
                     model = "sentence-transformers/all-mpnet-base-v2")

We can now see the embeddings for each risk

risk_embed
## 
## embedR object
## 
## Embedding: 99 objects and 768 dimensions.
## 
## 
## Embedding
## 
##                       [,1]       [,2]         [,3]         [,4]         [,5]
## assassin       0.018385978 0.04980664 -0.004807225  0.035280682 -0.013225548
## assassination  0.040638089 0.02382069  0.010378473  0.016700501 -0.040772166
## bomb          -0.001085747 0.07159345 -0.014801372  0.031823762 -0.004992277
## crime          0.018048594 0.08018764  0.026342748  0.001579845 -0.075590685
## criminal      -0.025894137 0.08807318  0.009723256 -0.009204310 -0.067338876

We will use multidimensional scaling to reduce the dimensions to two which helps us in visualizing the result. The following plot is colored by “Team Risks’s” domain labels. Looking at the plot, most of the items are well withing their respective labels, but there are still outliers.

risk_reduced_df <-  risk_embed %>%
  er_project() %>%
  er_frame()

risk_table_embed <- risk_dat %>% 
  inner_join(risk_reduced_df, by = c("Risk/Items" = "text"))

risk_table_embed %>% 
  ggplot(aes(dim_1, dim_2, label = `Risk/Items`, col = `Domain/Label`))+
  geom_point(size = 2)+
  geom_text(nudge_y = .01, size = 5.5)+
  theme_minimal()+
   theme(legend.position = "top") +
   labs(y = "Dimension 2",
        x = "Dimension 1",
        title = "Single Risk Items")

Instead of comparing each risk to others visually, we can calculate the cosine similarity so we also have a numerical representation of how similar they are to each other (this would also help us in choosing which risks to take, as we don’t want too similar ones, as they would become redundant).

risk_cosine <- risk_embed %>% 
  er_compare_vectors(metric = "cosine")


risk_cosine_clean <- crossing(item_i = rownames(risk_cosine), 
           item_j = colnames(risk_cosine)) %>% 
  mutate(cosine = risk_cosine[cbind(item_i, item_j)]) %>% 
  left_join(risk_table_embed %>% select(`Domain/Label`, `Risk/Items`), 
            by = c("item_i" = "Risk/Items")) %>% 
  left_join(risk_table_embed %>% select(`Domain/Label`, `Risk/Items`), 
            by = c("item_j" = "Risk/Items"), suffix = c("_i", "_j"))

Let’s do the same for the contextualized risks!

risk_contextualized_embed <- er_embed(text = risk_dat_contextualized$`Risk/Items`, 
                     api = "huggingface", 
                     model = "sentence-transformers/all-mpnet-base-v2")

risk_reduced_contextualized_df <-  risk_contextualized_embed %>%
  er_project() %>%
  er_frame()

risk_table_contextualized_embed <- risk_dat_contextualized %>% 
  inner_join(risk_reduced_contextualized_df, by = c("Risk/Items" = "text")) %>%
   inner_join(risk_dat, by = "Risk_ID", suffix = c("_context", "_single_risk"))

risk_amount <- nrow(risk_dat)
for (i in 1:length(dimension)) {
   dim <- dimension[i]
   
   assign(x = paste(dim, "reduced_contextualized_df", sep = "_"),
          value = risk_table_contextualized_embed[seq(from = 1 + risk_amount * (i -1), to = risk_amount * i),])
}
risk_table_contextualized_embed %>% 
   ggplot(aes(x = dim_1, y = dim_2, label = dimension, col = `Domain/Label_single_risk`)) +
   geom_point(size = 2) +
   geom_text(nudge_y = .01, size = 5.5) +
   theme_minimal() +
   theme(legend.position = "top") +
   labs(y = "Dimension 2",
        x = "Dimension 1",
        title = "Mapping of different contextualized risks")

We can see from this plot that these different dimensions are in fact different from the perspective of embeddings. Only the generic and affective portion are more or less the same (at least in the 2 dimensions).

For those who are interested, here is each dimension specific visualization. Be careful of the different axis- scales though! As described above, they do differ between dimensions!

Generic Risk Framing

generic_reduced_contextualized_df %>% 
ggplot(aes(dim_1, dim_2, label = `Risk/Items_single_risk`, col = `Domain/Label_single_risk`)) +
   geom_point(size = 2) +
   geom_text(nudge_y = .01, size = 3) +
   theme_minimal() +
   theme(legend.position = "top") +
   labs(y = "Dimension 2",
        x = "Dimension 1",
        title = "Generic Framing")

Affective Framing

affect_reduced_contextualized_df %>% 
ggplot(aes(dim_1, dim_2, label = `Risk/Items_single_risk`, col = `Domain/Label_single_risk`)) +
   geom_point(size = 2) +
   geom_text(nudge_y = .01, size = 3) +
   theme_minimal() +
   theme(legend.position = "top") +
   labs(y = "Dimension 2",
        x = "Dimension 1",
        title = "Affect Framing")

Probabilistic Framing

probability_reduced_contextualized_df %>% 
ggplot(aes(dim_1, dim_2, label = `Risk/Items_single_risk`, col = `Domain/Label_single_risk`)) +
   geom_point(size = 2) +
   geom_text(nudge_y = .01, size = 3) +
   theme_minimal() +
   theme(legend.position = "top") +
   labs(y = "Dimension 2",
        x = "Dimension 1",
        title = "Probability Framing")

Consequence Framing

consequences_reduced_contextualized_df %>% 
ggplot(aes(dim_1, dim_2, label = `Risk/Items_single_risk`, col = `Domain/Label_single_risk`)) +
   geom_point(size = 2) +
   geom_text(nudge_y = .01, size = 3) +
   theme_minimal() +
   theme(legend.position = "top") +
   labs(y = "Dimension 2",
        x = "Dimension 1",
        title = "Consequence Framing")

Overall Stability of Risks in Different Dimensions

From a first glance, our labels are still doing fine even in a contextualized perspective.

Though we also want to investigate whether the risks are stable across those different dimensions! For this, we can compare each risk with another risk using the cosine similarity (created with the embeddings).

risk_cosine_contextualized <- risk_contextualized_embed %>% 
  er_compare_vectors(metric = "cosine")

risk_cosine_contextualized_clean <- crossing(item_i = rownames(risk_cosine_contextualized), 
           item_j = colnames(risk_cosine_contextualized)) %>% 
  mutate(cosine = risk_cosine_contextualized[cbind(item_i, item_j)]) %>% 
  left_join(risk_table_contextualized_embed %>% select(`Domain/Label_context`, `Risk/Items_context`, `Risk/Items_single_risk`, dimension), 
            by = c("item_i" = "Risk/Items_context")) %>% 
  left_join(risk_table_contextualized_embed %>% select(`Domain/Label_context`, `Risk/Items_context`, `Risk/Items_single_risk`, dimension), 
            by = c("item_j" = "Risk/Items_context"), suffix = c("_i", "_j")) %>% 
   filter(item_j != item_i) 
base_cosine_ID <- risk_cosine_clean %>% 
   filter(item_i != item_j) %>% 
   left_join(risk_dat, by = c("item_i" = "Risk/Items")) %>% 
   left_join(risk_dat %>% 
                select(`Risk/Items`, Risk_ID), by = c("item_j" = "Risk/Items"), suffix = c("_i", "_j")) %>% 
   mutate(Risk_ID_i_j = paste(Risk_ID_i, Risk_ID_j)) %>% 
   select(cosine, Risk_ID_i_j)


context_cosine_ID <- risk_cosine_contextualized_clean %>% 
   select(cosine, `Risk/Items_single_risk_i`, `Risk/Items_single_risk_j`, dimension_i, dimension_j) %>% 
   left_join(risk_dat, by = c("Risk/Items_single_risk_i" = "Risk/Items")) %>% 
   left_join(risk_dat %>% 
                select(`Risk/Items`, Risk_ID), by = c("Risk/Items_single_risk_j" = "Risk/Items"), suffix = c("_i", "_j")) %>% 
   select(cosine, Risk_ID_i, Risk_ID_j, dimension_i, dimension_j) %>% 
   filter(dimension_i == dimension_j) %>%
   mutate(Risk_ID_i_j = paste(Risk_ID_i, Risk_ID_j),
          dimension = dimension_i) %>% 
   select(cosine, Risk_ID_i_j, dimension, Risk_ID_i)

joined_cosine <- context_cosine_ID %>% 
   pivot_wider(names_from = dimension,
               values_from = cosine,
               names_prefix = "cosine_") %>% 
   left_join(base_cosine_ID, by = "Risk_ID_i_j")

cor_mat_comparison <- joined_cosine %>% 
   summarize(
      base_gen = cor(cosine, cosine_generic),
      base_aff = cor(cosine, cosine_affect),
      base_prob = cor(cosine, cosine_probability),
      base_cons = cor(cosine, cosine_consequences),
      gen_aff = cor(cosine_generic, cosine_affect),
      gen_prob = cor(cosine_generic, cosine_probability),
      gen_cons = cor(cosine_generic, cosine_consequences),
      affect_prob = cor(cosine_affect, cosine_probability),
      affect_cons = cor(cosine_affect, cosine_consequences),
      prob_cons = cor(cosine_probability, cosine_consequences)
   )

cor_mat_comparison %>% 
   t() %>% 
   as.data.frame() %>% 
   kable(col.names = "Correlations", digits = 3)
Correlations
base_gen 0.754
base_aff 0.728
base_prob 0.638
base_cons 0.682
gen_aff 0.874
gen_prob 0.822
gen_cons 0.837
affect_prob 0.840
affect_cons 0.835
prob_cons 0.781

While these pairwise correlations indicate moderate to strong correlations over different dimensions/framing contexts (e.g. meaning that the similarity between risks are not greatly influenced by the context the risk appears in), these are averaged over the whole risk set. Let’s plot some more and see whether there were strong outliers!

joined_cosine %>% 
   select(cosine_generic, contains("cosine")) %>% 
   ggpairs(progress = FALSE, 
           axisLabels = "show",
           columnLabels = c("Generic", "Probability", "Consequences", "Affect", "Single"))+
   theme_minimal()+
   theme(strip.text = element_text(size = 14))

Unique Stability of Risk in Different Dimensions

While these correlations are within their dimension setting, what about the single risks? We want to use risks items which are stable across settings!

We can therefore group by the Risk itself, and calculate those correlations again (e.g. correlation of two cosine similarity vectors of each pairing with the risk). A correlation of 1 means that the distance/ cosine similarity stayed the same independent of the dimension setting.

I therefore calculated a mean score consisting of the correlations over each pairwise context setting. Higher score means that said risk item had consistent distances with other risks over all the pairwise settings.

cor_mat_per_risk <- joined_cosine %>% 
   group_by(Risk_ID_i) %>% 
   summarize(
      base_gen = cor(cosine, cosine_generic),
      base_aff = cor(cosine, cosine_affect),
      base_prob = cor(cosine, cosine_probability),
      base_cons = cor(cosine, cosine_consequences),
      gen_aff = cor(cosine_generic, cosine_affect),
      gen_prob = cor(cosine_generic, cosine_probability),
      gen_cons = cor(cosine_generic, cosine_consequences),
      affect_prob = cor(cosine_affect, cosine_probability),
      affect_cons = cor(cosine_affect, cosine_consequences),
      prob_cons = cor(cosine_probability, cosine_consequences)
   ) %>% 
   rowwise() %>% 
   mutate(mean_score = mean(c(base_gen,
                             base_aff,
                             base_prob,
                             base_cons,
                             gen_aff,
                             gen_prob,
                             gen_cons,
                             affect_prob,
                             affect_cons,
                             prob_cons))) %>% 
   arrange(desc(mean_score)) %>% 
   left_join(risk_dat, by = c("Risk_ID_i" = "Risk_ID"))

cor_mat_per_risk %>% 
   select(- Risk_ID_i) %>% 
   relocate(`Risk/Items`, `Domain/Label`) %>% 
   kable(digits = 3)
Risk/Items Domain/Label base_gen base_aff base_prob base_cons gen_aff gen_prob gen_cons affect_prob affect_cons prob_cons mean_score
greenhouse gases Nature 0.922 0.932 0.879 0.917 0.956 0.923 0.932 0.908 0.949 0.914 0.923
political parties Political 0.905 0.912 0.892 0.887 0.948 0.953 0.933 0.940 0.914 0.925 0.921
political preference Political 0.905 0.919 0.894 0.892 0.939 0.957 0.929 0.936 0.922 0.913 0.921
political partisanship Political 0.908 0.920 0.884 0.895 0.952 0.942 0.934 0.932 0.921 0.904 0.919
political inclinations Political 0.917 0.922 0.869 0.899 0.946 0.939 0.942 0.922 0.927 0.887 0.917
emissions Nature 0.892 0.862 0.860 0.860 0.947 0.920 0.926 0.911 0.909 0.892 0.898
political ideology Political 0.870 0.879 0.840 0.834 0.939 0.930 0.936 0.918 0.913 0.892 0.895
coal power Nature 0.887 0.876 0.851 0.850 0.944 0.902 0.898 0.908 0.924 0.899 0.894
homicide Crime 0.875 0.868 0.843 0.836 0.926 0.905 0.911 0.951 0.926 0.881 0.892
trust government Political 0.893 0.902 0.849 0.886 0.933 0.868 0.905 0.909 0.895 0.865 0.891
natural gas Nature 0.891 0.858 0.829 0.838 0.952 0.892 0.917 0.899 0.920 0.904 0.890
political position Political 0.889 0.872 0.844 0.829 0.929 0.953 0.911 0.919 0.875 0.879 0.890
murder Crime 0.868 0.843 0.810 0.816 0.934 0.894 0.921 0.921 0.925 0.873 0.881
influenza Health 0.873 0.870 0.770 0.877 0.924 0.896 0.910 0.884 0.928 0.872 0.880
party identification Political 0.870 0.869 0.865 0.796 0.924 0.928 0.866 0.904 0.852 0.848 0.872
zika virus Health 0.880 0.870 0.840 0.854 0.875 0.904 0.868 0.877 0.877 0.871 0.872
nuclear power Nature 0.867 0.839 0.830 0.830 0.920 0.881 0.885 0.899 0.890 0.863 0.870
climate change Nature 0.874 0.864 0.752 0.810 0.946 0.886 0.906 0.880 0.919 0.861 0.870
murderer Crime 0.850 0.841 0.802 0.810 0.924 0.921 0.880 0.932 0.863 0.850 0.867
tsunami Nature 0.850 0.830 0.764 0.831 0.919 0.897 0.893 0.902 0.921 0.861 0.867
global warming Nature 0.843 0.818 0.735 0.773 0.952 0.911 0.926 0.894 0.936 0.876 0.866
cyclones Nature 0.856 0.821 0.746 0.850 0.933 0.862 0.934 0.864 0.942 0.848 0.866
liberal Political 0.831 0.829 0.795 0.763 0.929 0.904 0.906 0.922 0.884 0.886 0.865
earthquakes Nature 0.860 0.808 0.771 0.829 0.885 0.876 0.924 0.878 0.924 0.885 0.864
politician Political 0.834 0.818 0.777 0.786 0.927 0.930 0.895 0.913 0.861 0.901 0.864
environmental threats Nature 0.848 0.851 0.768 0.858 0.915 0.856 0.889 0.861 0.912 0.867 0.862
criminal Crime 0.841 0.851 0.768 0.739 0.931 0.890 0.901 0.940 0.889 0.846 0.859
wildfires Nature 0.855 0.824 0.754 0.810 0.903 0.882 0.902 0.875 0.910 0.875 0.859
economic crisis Others (Social/Finance) 0.835 0.848 0.767 0.817 0.920 0.874 0.895 0.859 0.926 0.848 0.859
anarchy Political 0.817 0.772 0.764 0.761 0.935 0.913 0.917 0.912 0.891 0.901 0.858
fascism Political 0.807 0.817 0.786 0.766 0.933 0.888 0.883 0.920 0.900 0.879 0.858
floods Nature 0.846 0.811 0.737 0.824 0.913 0.868 0.906 0.875 0.931 0.861 0.857
recession Others (Social/Finance) 0.818 0.822 0.728 0.787 0.913 0.884 0.902 0.885 0.939 0.878 0.856
assassination Crime 0.856 0.805 0.780 0.809 0.931 0.922 0.862 0.930 0.848 0.802 0.855
hurricanes Nature 0.824 0.770 0.721 0.796 0.911 0.880 0.922 0.880 0.931 0.859 0.849
fracking Nature 0.826 0.789 0.765 0.775 0.912 0.889 0.907 0.879 0.888 0.863 0.849
pollution Nature 0.779 0.791 0.709 0.747 0.934 0.881 0.915 0.891 0.950 0.897 0.849
drought Nature 0.791 0.794 0.699 0.771 0.937 0.890 0.931 0.864 0.933 0.881 0.849
crime Crime 0.820 0.815 0.754 0.762 0.909 0.885 0.881 0.933 0.869 0.842 0.847
low level of criminal punishment Crime 0.823 0.840 0.776 0.797 0.918 0.852 0.872 0.899 0.860 0.824 0.846
covid-19 Health 0.824 0.824 0.743 0.826 0.908 0.849 0.882 0.866 0.890 0.836 0.845
nuclear bomb Crime 0.831 0.775 0.729 0.782 0.906 0.853 0.902 0.896 0.906 0.848 0.843
introducing new product to the market Others (Social/Finance) 0.882 0.836 0.772 0.815 0.891 0.845 0.846 0.849 0.855 0.827 0.842
dictator Political 0.770 0.774 0.722 0.727 0.931 0.909 0.883 0.939 0.881 0.878 0.842
hiv Health 0.850 0.789 0.760 0.776 0.893 0.872 0.867 0.869 0.882 0.842 0.840
gene technology Health 0.849 0.803 0.810 0.782 0.892 0.858 0.853 0.838 0.812 0.843 0.834
diabetes Health 0.803 0.778 0.699 0.740 0.928 0.885 0.874 0.881 0.898 0.845 0.833
health behaviour Political 0.799 0.805 0.757 0.755 0.936 0.843 0.858 0.875 0.892 0.778 0.830
kidnapping Crime 0.853 0.815 0.714 0.716 0.908 0.862 0.844 0.898 0.864 0.804 0.828
conservative Political 0.802 0.830 0.726 0.762 0.897 0.821 0.866 0.872 0.843 0.834 0.825
terrorist Crime 0.804 0.766 0.665 0.771 0.886 0.842 0.887 0.895 0.879 0.845 0.824
climate Nature 0.753 0.784 0.664 0.774 0.929 0.839 0.887 0.842 0.910 0.841 0.822
vaccine Health 0.789 0.682 0.769 0.811 0.858 0.836 0.883 0.795 0.846 0.870 0.814
decreasing turnout Political 0.792 0.819 0.762 0.785 0.896 0.884 0.759 0.843 0.832 0.753 0.812
marketization Others (Social/Finance) 0.843 0.801 0.654 0.677 0.937 0.838 0.821 0.873 0.856 0.795 0.810
assassin Crime 0.828 0.811 0.728 0.752 0.873 0.741 0.838 0.866 0.851 0.785 0.807
cigarettes Health 0.805 0.776 0.690 0.750 0.913 0.847 0.850 0.843 0.827 0.771 0.807
bomb Crime 0.794 0.697 0.690 0.768 0.876 0.842 0.851 0.889 0.818 0.772 0.800
propaganda Political 0.749 0.756 0.667 0.719 0.897 0.816 0.877 0.881 0.849 0.775 0.799
refugee Others (Social/Finance) 0.786 0.726 0.636 0.740 0.890 0.864 0.857 0.848 0.827 0.804 0.798
electric power Nature 0.792 0.750 0.694 0.779 0.920 0.793 0.823 0.832 0.824 0.768 0.798
killing Crime 0.796 0.759 0.678 0.709 0.896 0.841 0.847 0.873 0.820 0.738 0.796
sun protection Health 0.774 0.696 0.646 0.771 0.839 0.816 0.868 0.799 0.854 0.839 0.790
ageing of population Others (Social/Finance) 0.766 0.821 0.662 0.796 0.852 0.845 0.769 0.807 0.831 0.750 0.790
genocide Crime 0.788 0.734 0.697 0.729 0.827 0.808 0.852 0.827 0.816 0.795 0.787
holocaust Crime 0.746 0.746 0.711 0.694 0.866 0.821 0.811 0.845 0.846 0.767 0.785
corruption Political 0.655 0.695 0.597 0.608 0.913 0.895 0.866 0.912 0.857 0.846 0.784
suicide Others (Social/Finance) 0.810 0.745 0.670 0.696 0.900 0.784 0.855 0.818 0.837 0.723 0.784
poison Crime 0.753 0.808 0.606 0.775 0.919 0.762 0.823 0.787 0.869 0.737 0.784
terrorism Crime 0.724 0.687 0.647 0.673 0.884 0.828 0.843 0.870 0.854 0.823 0.783
epidemic Health 0.785 0.742 0.647 0.773 0.860 0.829 0.844 0.736 0.819 0.798 0.783
sexual assault Crime 0.782 0.800 0.715 0.659 0.869 0.798 0.823 0.860 0.808 0.690 0.780
inflation Others (Social/Finance) 0.741 0.683 0.648 0.656 0.876 0.830 0.771 0.872 0.819 0.851 0.775
drowning Health 0.766 0.731 0.618 0.701 0.855 0.805 0.866 0.794 0.837 0.772 0.775
cancer Health 0.661 0.698 0.552 0.648 0.867 0.872 0.879 0.863 0.850 0.817 0.771
drugs Health 0.737 0.729 0.587 0.668 0.917 0.814 0.811 0.829 0.853 0.730 0.767
mental health Health 0.776 0.756 0.624 0.667 0.848 0.827 0.848 0.774 0.807 0.736 0.766
revolution Political 0.642 0.568 0.577 0.585 0.912 0.898 0.856 0.885 0.815 0.892 0.763
globalization Others (Social/Finance) 0.693 0.676 0.504 0.618 0.926 0.860 0.836 0.865 0.833 0.792 0.760
dementia Health 0.718 0.724 0.614 0.586 0.881 0.864 0.801 0.840 0.815 0.749 0.759
cannabis Health 0.762 0.742 0.571 0.701 0.914 0.789 0.815 0.751 0.851 0.673 0.757
health Health 0.751 0.748 0.600 0.696 0.865 0.785 0.806 0.808 0.783 0.645 0.749
consumerism Others (Social/Finance) 0.568 0.633 0.538 0.553 0.902 0.859 0.807 0.853 0.839 0.757 0.731
speeding Crime 0.733 0.665 0.547 0.600 0.857 0.746 0.820 0.736 0.824 0.777 0.730
skiing Others (Social/Finance) 0.761 0.658 0.663 0.668 0.811 0.703 0.749 0.711 0.798 0.741 0.726
fraud Political 0.635 0.649 0.556 0.508 0.822 0.810 0.743 0.856 0.785 0.699 0.706
war Crime 0.684 0.628 0.605 0.596 0.782 0.771 0.681 0.825 0.739 0.704 0.702
pregnancy Health 0.674 0.653 0.510 0.584 0.778 0.790 0.856 0.640 0.723 0.707 0.692
social media Others (Social/Finance) 0.662 0.556 0.513 0.568 0.853 0.774 0.756 0.820 0.662 0.735 0.690
social risk Others (Social/Finance) 0.700 0.719 0.600 0.622 0.804 0.650 0.651 0.780 0.730 0.578 0.683
stress Health 0.711 0.635 0.438 0.643 0.815 0.628 0.789 0.666 0.785 0.651 0.676
alcohol Health 0.719 0.723 0.437 0.658 0.865 0.634 0.746 0.639 0.802 0.494 0.672
unemployment Others (Social/Finance) 0.669 0.619 0.472 0.471 0.801 0.784 0.651 0.813 0.773 0.662 0.671
motor vehicles Others (Social/Finance) 0.679 0.630 0.487 0.581 0.859 0.746 0.642 0.745 0.689 0.575 0.663
farming Nature 0.616 0.481 0.545 0.498 0.853 0.765 0.721 0.740 0.659 0.582 0.646
media Others (Social/Finance) 0.651 0.608 0.405 0.563 0.882 0.641 0.710 0.686 0.588 0.569 0.630
changing values Others (Social/Finance) 0.543 0.582 0.557 0.476 0.690 0.677 0.781 0.633 0.412 0.672 0.602
artificial intelligence Others (Social/Finance) 0.620 0.468 0.272 0.473 0.781 0.651 0.671 0.648 0.675 0.602 0.586
isolation Others (Social/Finance) 0.612 0.550 0.058 0.452 0.837 0.246 0.734 0.328 0.741 0.339 0.490

Let us pick the top performing risk (e.g. the risk where the distance between other risks stayed the same, independent of the setting).
In this case, it was greenhouse gases

top_risk <- cor_mat_per_risk$Risk_ID_i[1]

joined_cosine %>% 
   filter(Risk_ID_i == top_risk) %>% 
   select(contains("cosine")) %>% 
      ggpairs(progress = FALSE, 
           axisLabels = "show",
           columnLabels = c("Generic", "Probability", "Consequences", "Affect", "Single"))+
   theme_minimal()+
   theme(strip.text = element_text(size = 14))

Comparing each Risk sepparately between Different Dimensions

The prior approach only compared how stable the distance between a risk is with all the other risk in comparison with the same risk combination in the other dimensions. While we do see that most of the risks have stable distance to the other risks among the different settings, this was only in comparison with other risks, not as a standalone.

We will adress this problem here, where we now compare each risk with itself in different dimensions. For example, we will compare “war” in the general setting with “war” in the affect setting. We will use the 768 dimensions for each risk and correlate them with it’s counterpart in the other settings.

large_embed <- rbind(risk_embed, risk_contextualized_embed)
pairw_cor_df<- data.frame()

for (i in 1:nrow(risk_dat)){
   small_mat <-
      large_embed[seq(from = i,
                      to = nrow(large_embed),
                      by = nrow(risk_dat)), ] %>% 
      t()
   
   risk_word <- colnames(small_mat)[1]
   
   colnames(small_mat) <-  c("single", "affect", "consequences", "general", "probability")
   
   small_mat <- small_mat %>% 
      cor() %>% 
      as.data.frame() %>%
      rownames_to_column(var = "dim_1") %>% 
      pivot_longer(cols = 2:6, names_to = "dim_2", values_to = "correlation")
      
   cor_long <- small_mat[small_mat$dim_1 > small_mat$dim_2, ]
   
   cor_long$risk <- risk_word

   pairw_cor_df <- rbind(pairw_cor_df, cor_long)
}

wide_pairw_cor <- pairw_cor_df %>%
   pivot_wider(
      id_cols = risk,
      names_from = c(dim_1, dim_2),
      values_from = correlation
   )
wide_pairw_cor 
## # A tibble: 99 × 11
##    risk      single_affect single_consequences single_general single_probability
##    <chr>             <dbl>               <dbl>          <dbl>              <dbl>
##  1 assassin          0.591               0.510          0.614              0.428
##  2 assassin…         0.570               0.593          0.626              0.505
##  3 bomb              0.507               0.486          0.628              0.485
##  4 crime             0.563               0.519          0.596              0.505
##  5 criminal          0.527               0.504          0.546              0.416
##  6 genocide          0.607               0.649          0.655              0.554
##  7 holocaust         0.594               0.651          0.669              0.573
##  8 homicide          0.509               0.564          0.564              0.509
##  9 kidnappi…         0.615               0.568          0.650              0.521
## 10 killing           0.463               0.431          0.501              0.352
## # ℹ 89 more rows
## # ℹ 6 more variables: consequences_affect <dbl>, general_affect <dbl>,
## #   general_consequences <dbl>, probability_affect <dbl>,
## #   probability_consequences <dbl>, probability_general <dbl>

Again, we now have all the risks and 10 pairwise correlations. I’ll rank them based on the mean of all the correlations, so we can see which risk was the most stable across dimensions.

wide_pairw_cor <- wide_pairw_cor %>%
   rowwise() %>%
   mutate(
      cor_mean = mean(c(
         single_affect,
         single_consequences,
         single_general,
         single_probability,
         consequences_affect,
         general_affect,
         general_consequences,
         probability_affect,
         probability_consequences,
         probability_general
      ))
   ) %>%
   arrange(desc(cor_mean))

wide_pairw_cor %>% 
   relocate(cor_mean, .after = risk) %>% 
   head(50) %>% 
   left_join(risk_dat, by = c("risk" = "Risk/Items")) %>% 
   select(risk,`Domain/Label`, cor_mean ) %>% 
   kable(digits = 3)
risk Domain/Label cor_mean
decreasing turnout Political 0.734
social risk Others (Social/Finance) 0.726
economic crisis Others (Social/Finance) 0.715
zika virus Health 0.711
low level of criminal punishment Crime 0.708
coal power Nature 0.699
recession Others (Social/Finance) 0.696
gene technology Health 0.694
fracking Nature 0.693
environmental threats Nature 0.689
anarchy Political 0.689
wildfires Nature 0.689
ageing of population Others (Social/Finance) 0.686
cyclones Nature 0.685
drought Nature 0.683
floods Nature 0.682
trust government Political 0.678
earthquakes Nature 0.674
influenza Health 0.673
tsunami Nature 0.672
greenhouse gases Nature 0.670
epidemic Health 0.670
natural gas Nature 0.669
fascism Political 0.667
introducing new product to the market Others (Social/Finance) 0.666
kidnapping Crime 0.664
political partisanship Political 0.663
emissions Nature 0.661
nuclear power Nature 0.660
corruption Political 0.659
consumerism Others (Social/Finance) 0.658
fraud Political 0.657
health behaviour Political 0.654
refugee Others (Social/Finance) 0.653
dementia Health 0.650
sun protection Health 0.649
political preference Political 0.644
speeding Crime 0.644
holocaust Crime 0.643
genocide Crime 0.642
hurricanes Nature 0.642
pollution Nature 0.641
party identification Political 0.639
revolution Political 0.638
political inclinations Political 0.638
poison Crime 0.637
globalization Others (Social/Finance) 0.637
vaccine Health 0.636
skiing Others (Social/Finance) 0.636
covid-19 Health 0.634
# top_50_list <- wide_pairw_cor %>% 
#    relocate(cor_mean, .after = risk) %>% 
#    head(50) %>% 
#    left_join(risk_dat, by = c("risk" = "Risk/Items")) %>% 
#    select(risk,`Domain/Label`, cor_mean )
# 
# wb <- createWorkbook(top_50_list)
# addWorksheet(wb, sheetName = "Top 50 Risks")
# writeDataTable(wb, "Top 50 Risks", x = top_50_list)
# saveWorkbook(wb, "risk_embedded.xlsx")

How many Domains?

One could ask whether these 5 domains are accurately describing our risks. Luckily, thanks to the embeddings, we have a similarity matrix now (cosine similarity). With this, kmeans clustering becomes available. We are using the single risk dataframe, so without context/ question.

The following plots are generated with differing cluster amounts. As a rule of thumb, the higher the F-statistic (ratio of within labels sum of squares and between labels sum of squares), the better.

3 Clusters

set.seed(123)

k <- 3
N <- dim(risk_cosine)[1]

cl_risk <- kmeans(risk_cosine, centers = k)

Fstat <- (cl_risk$betweenss / (k-1)) / (cl_risk$tot.withinss/ (N-k))

risk_table_embed %>% 
  ggplot(aes(dim_1, dim_2, label = `Risk/Items`, col = factor(cl_risk$cluster)))+
  geom_point(size = 2)+
  geom_text(nudge_y = .01, size = 4)+
  theme_minimal()+
   theme(legend.position = "top")+
   ylab("Dimension 2")+
   xlab("Dimension 1")+
   labs(color = "Risk Clusters")+
   annotate("text", x = 0, y= .22, label = paste0("F-value = ", round(Fstat,3)), size = 5, color = "black")

4 Clusters

set.seed(123)
k <- 4
N <- dim(risk_cosine)[1]

cl_risk <- kmeans(risk_cosine, centers = k)

Fstat <- (cl_risk$betweenss / (k-1)) / (cl_risk$tot.withinss/ (N-k))

risk_table_embed %>% 
  ggplot(aes(dim_1, dim_2, label = `Risk/Items`, col = factor(cl_risk$cluster)))+
  geom_point(size = 2)+
  geom_text(nudge_y = .01, size = 4)+
  theme_minimal()+
   theme(legend.position = "top")+
   ylab("Dimension 2")+
   xlab("Dimension 1")+
   labs(color = "Risk Clusters")+
   annotate("text", x = 0, y= .22, label = paste0("F-value = ", round(Fstat,3)), size = 5, color = "black")

5 Clusters

set.seed(123)
k <- 5
N <- dim(risk_cosine)[1]

cl_risk <- kmeans(risk_cosine, centers = k)

Fstat <- (cl_risk$betweenss / (k-1)) / (cl_risk$tot.withinss/ (N-k))

risk_table_embed %>% 
  ggplot(aes(dim_1, dim_2, label = `Risk/Items`, col = factor(cl_risk$cluster)))+
  geom_point(size = 2)+
  geom_text(nudge_y = .01, size = 4)+
  theme_minimal()+
   theme(legend.position = "top")+
   ylab("Dimension 2")+
   xlab("Dimension 1")+
   labs(color = "Risk Clusters")+
   annotate("text", x = 0, y= .22, label = paste0("F-value = ", round(Fstat,3)), size = 5, color = "black")

F-values indicate that 4 different clusters is best!

What are the clusters?

Here are the risks, sepparated to the different 4 clusters! Now the only thing we need to do is making sense of these clusters…

set.seed(42)

cl_risk <- kmeans(risk_cosine, centers = 4)

risks <- colnames(risk_cosine)
cluster <- cl_risk$cluster

clustered_df <- data.frame(risks, cluster)
rownames(clustered_df) <- NULL

clustered_df_wide <- clustered_df %>%
   pivot_wider(names_from = cluster,
               names_prefix = "cluster_",
               values_from = risks,
               values_fn = list(risks = list))

cluster_dat <- data.frame(
   unlist(clustered_df_wide$cluster_1),
   c(unlist(clustered_df_wide$cluster_2), rep(NA, 4)),
   c(unlist(clustered_df_wide$cluster_3), rep(NA, 7)),
   c(unlist(clustered_df_wide$cluster_4), rep(NA, 10))
)
colnames(cluster_dat) <- paste0("cluster_", 1:4)

cluster_dat %>%
   kable()
cluster_1 cluster_2 cluster_3 cluster_4
gene technology low level of criminal punishment poison assassin
sun protection consumerism speeding assassination
climate economic crisis alcohol bomb
climate change globalization cancer crime
coal power marketization cannabis criminal
cyclones media cigarettes genocide
drought recession covid-19 holocaust
earthquakes refugee dementia homicide
electric power social risk diabetes kidnapping
emissions anarchy drowning killing
environmental threats conservative drugs murder
farming decreasing turnout epidemic murderer
floods dictator health nuclear bomb
fracking fascism hiv sexual assault
global warming liberal influenza terrorism
greenhouse gases party identification mental health terrorist
hurricanes political ideology pregnancy war
natural gas political inclinations stress suicide
nuclear power political parties vaccine corruption
pollution political partisanship zika virus fraud
tsunami political position isolation NA
wildfires political preference unemployment NA
ageing of population politician health behaviour NA
artificial intelligence propaganda NA NA
changing values revolution NA NA
inflation trust government NA NA
introducing new product to the market NA NA NA
motor vehicles NA NA NA
skiing NA NA NA
social media NA NA NA

Discussion

In this smaller project, which was done to make an informed decision for choosing our polarized risks which are still stable on different contexts (e.g. they do not change their meaning depending on how the risk is framed/ the question was built upon). We have a ranked list of stable to unstable words.
Likewise, the cluster analysis hinted towards 4 clusters, though we do not have a good label to group them by…

Limitations

  • Quick- paced project, so there was less room for discussion as the presentation had to be done as soon as possible.
  • Likewise, the plots and their labels were not really pretty, but they should be sufficient for my goal.
  • While we set a seed for replicabilities sake, for some reason the numbers changes slightly after knitting it. This may be caused by the embeddings, because they may be updated regularly.

Credits

References

R Packages Used